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FPGA implementation of a 32x32 autocorrelator array for analysis of fast 
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With the evolving technology in CMOS integration, new classes of 2D-iniaging detectors have recently become 
available. In particular, single photon avalanche diode (SPAD) arrays allow detection of single photons at high 
acquisition rates (> lOOkfps), which is about two orders of magnitude higher than with currently available 
cameras. Here we demonstrate the use of a SPAD array for imaging fluorescence correlation spectroscopy 
(imFCS) , a tool to create 2D maps of the dynamics of fluorescent molecules inside living cells. Time-dependent 
fluorescence fluctuations, due to fluorophores entering and leaving the observed pixels, are evaluated by means 
of autocorrelation analysis. The multi-r correlation algorithm is an appropriate choice, as it does not rely 
on the full data set to be held in memory. Thus, this algorithm can be efficiently implemented in custom 
logic. We describe a new implementation for massively parallel multi-r correlation hardware. Our current 
implementation can calculate 1024 correlation functions at a resolution of 10 /is in real-time and therefore 
correlate real-time image streams from high speed single photon cameras with thousands of pixels. 

PACS numbers: 42.50.Ar, 42.62.Fi, 78.47.je 

Keywords: Correlator; FPGA; SPIM; SPAD; FCS; Photon counting and statistics 



I. INTRODUCTION 

Fluorescence correlation spectroscopy^''^ (FCS) is a 
powerful experimental technique for measuring the dy- 
namics of fluorescently labeled molecules in solution and 
also inside living cells. It allows one to determine the par- 
ticle number, the diffusion coefficient, flow speeds and 
also photophysical and chemical reaction rates (for an 
overview, see Ref. 3). In FCS the time trace of the fluo- 
rescence intensity fluctuations lit) inside a small obser- 
vation volume (usually around 10^^^ 1 = 1/^m'^) is moni- 
tored. The fluctuations originate from particles entering 
and leaving the focus, or transitions between states hav- 
ing different quantum yields. Faster dynamics of the flu- 
orescing particles also lead to faster fluctuations, which 
can be quantified by means of a temporal first-order au- 
tocorrelation function (ACF): 

(i (r))j T^oo 1 Jo 

(1) 

The ACF usually contains features that are spread over 
several orders of magnitude in time (nanoseconds to sec- 
onds). 

The standard FCS setup uses a confocal microscope 
in combination with single-photon sensitive detectors to 
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acquire the fluorescence time trace I{t) from one focal 
volume. Then the data are fed into a "correlator" (hard- 
ware or software component), which estimates the ACF 
over a certain dynamic range. 

In the accompanying paper by Mocsar et al.'^ , a field 
programmable gate array (FPGA) implementation of 
such a correlator circuit for use with a confocal mi- 
croscope and up to two single photon avalanche diode 
(SPAD) detectors is presented. In recent years, the avail- 
ability of fast cameras has triggered the development of 
different imaging modalities for FCS''"' , which allow spa- 
tial mapping of the diffusion coefficient and other dy- 
namic properties by calculating an ACF for each of po- 
tentially many pixels. This requires correlation hardware 
that can process the data stream very fast, ideally in real 
time, for all pixels of the image sensor. Here we extend 
the idea of hardware reuse, presented before \ for appli- 
cation to imaging FCS. 

Our FPGA-based implementation can calculate 1024 
autocorrelation functions in parallel and in real time. 
The dynamic range of the ACFs is Tmin ■ • . Tmax = 
10 /is . . . 1 s. As a data acquisition device we use the single 
photon avalanche diode array (SPAD array) Radhard2^''' . 
From our sensor we read a 32 x 32 pixel frame every 
^iftamc = 10 /iS where each pixel contains 1 bit of infor- 
mation (no photon or at least one photon in the last 
Atframo)- The dcsigu presented here could easily be 
adapted to other image sensors such as customized com- 
plementary metal oxide semiconductor (CMOS) or elec- 
tron multiplying charge-coupled device (EMCCD) cam- 
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eras. 



II. MULTI-r HARDWARE CORRELATORS 

A hardware correlator estimates the ACF in Eq. 1 from 
a finite sequence of intensity measurements 



/ (n • Tmin + i) dt, n^O,l,...T-l (2) 



with the number of samples T and the integration time 
Tmin for ouc Sample. When discretizing Eq. 1 with this 
intensity sequence, care has to be taken not to bias the 
normalization 1/ (/)( . A viable choice is the "symmetric 
normalization" introduced in Ref. 10: 



T-1 



5sym(Tfe) 



T-1 



T-1 



T ' 5Z ^" 




(3) 



with a given set of lag times g N (in units of r,„i,i, so 
r = Tfc • Tniiii). When the full sequence {/,i}ti=o...t-i is 
available after the measurement, Eq. 3 may be evaluated 
directly for an arbitrary (also logarithmically spaced) set 
of lags Tk ■ This gives an unbiased estimation of the ACF 
( "direct correlation" ) . 

To implement our hardware correlator, we use the 
multi-T scheme introduced in Ref. 11, which is also il- 
lustrated and compared to a linear implementation in 
FIG. 1 (for a detailed description, please refer to our ac- 
companying paper Ref. 4). The multi-r scheme uses a 
set of S "linear" correlator blocks (FIG. l(b,c)). The in- 
put samples /s_„ {n is the same index as in Eq. 2) are 
summed over increasingly long periods An = to^: 



Is,n = '^In-k, for s>0 



(4) 



k=l 



with /o,„ = /„. 

Each of the linear correlators estimates the ACF at P 
linearly spaced lags 

To,o = 



Ts,0 = Ts-l,P-l + m 



s-P+p 



(5) 



where p = . . . P — 1 . 

In summary, this results in a quasi-logarithmic spac- 
ing of estimates gsym,muiti-r (ts^p). The advantage of this 
multi-T scheme is its simple implementation in hardware 



and a large dynamic time range with a reasonable num- 
ber of channels. Its disadvantage is a systematic error 
introduced by averaging: As shown in Ref. 12 the esti- 
mator gsym, multi-T (^s.p) cquals the ideal correlation func- 
tion g{Ts,p ■ Tmin) (see Eq. 3) convolved with a triangular 
kernel with width m*: 
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jlti-r (Ts,p) = g{Ts,p ■ Tmin) * ^{Ts.p, W^), (6) 



where * denotes the convolution product and A(r, At) = 
At - |t| for |t| < At and A(t, At) = for |t| > At, is 
the triangular shaped kernel. 



III. HARDWARE DESIGN 

Here we describe how the hardware reuse scheme in 
the accompanying paper Ref. 4 can be extended to ac- 
commodate many more input channels. Instead of the 
graphical tool used there, here we employed a low-level 
hardware description language to gain speed and flexibil- 
ity. This enables us to fine-tune many parameters of the 
final design, e.g. operational speed, memory usage, logic 
resource consumption and routing between logic cells. 



A. Single-Pixel Correlator 

As shown in FIG. 1, a typical correlator is made up 
from channels, each corresponding to a certain lag time 
and consisting of a multiplier, an accumulator and a delay 
element. 

The idea of our implementation is to use one sin- 
gle channel circuit to calculate all channels within one 
multi-T correlator. This is possible by serial processing 
of the lag time channels, since every channel's hardware 
is identical. 

The basic arithmetic operation of one channel is to 
multiply and accumulate (MAC). Therefore we can map 
its functionality onto a MAC unit which can be found 
on most FPGA architectures and which is considerably 
faster than using generic FPGA logic cells. Only about 
100 of these can be found on typical FPGAs, precluding 
any approach with blocks consisting of several lag chan- 
nels. 

As only one circuit is used to process all channels, we 
use an internal memory block (block random access mem- 
ory, BRAM) to store their state (i.e., the content of the 
accumulator and the delayed signal). We implement this 
circuit by decomposing it into five steps: 

1. Load accumulator and delayed value of a channel 
from memory 

2. Wait for memory access to complete 

3. Multiply delayed with global signal 

4. Add multiplication result to channel's accumulator 
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5. Store counter and new delayed value to memory 

To increase performance, these steps are executed in 
an interleaved manner for four channels simultaneously. 
This "pipelining" scheme is shown in TAB. I. Our five 
pipeline steps are compatible with the internal pipelin- 
ing of common MAC units. 



TABLE I. Interleaved pipeline of the linear correlator design 
with 8 channels. 



cycle c 


1 2 3 4 5 6 7 


8 9 10 11 12 13 14 15 




ch. 


L W M A S 


L W M A S 


ch. 4 


ch. 1 


L W M A S 


L W M A S 


ch. 5 


ch. 2 


L W M A S 


L W M A S 


ch. 6 


ch. 3 


L W M A S 


L W M A S 


ch. 7 



From one block to the next the delay time is doubled 
(m = 2) in the multi-r scheme, making the input data 
rate of block s + 1 half that of block s. Hence we need to 
execute each block only half as often as its predecessor. 
Thus, a complete multi-r correlator can be executed in 
only twice the run-time A<iin of a single linear correlator 
block: 



TABLE II. Binary representation of counter c that solves re- 
lations Eq. 8 for a given linear correlator block s. * denotes 
a don't care condition. 





binary representation of counter 






c that solves relation Eq. 8 


liTi oc\vr 




2'^ 2^ 2^ 2* 


23 22 2^ 2° 





* 


* * * * 


* * * 


1 


* 


* * * * 


* * 1 1 


2 


* 


* * * * 


* 1 


3 


* 


* * * * 


10 1 


4 


* 


* * * 


110 1 


5 


* 


* * 1 


110 1 


6 


* 


* 1 1 


110 1 


7 


* 


111 


110 1 



The linear correlator as implemented here, together 
with its scheduler and the summation logic, is called cor- 
relation processing element (CorrPE). All channel data 
and intermediate summation results, the so-called pixel 
context, arc stored in a dual-port BRAM which is asso- 
ciated with the CorrPE. 



^ 2^^^ lin. corr. 3^*^ lin. corr. 

T~Atii^ + ^-Atiin + i-Afiin +... 

^ -^^l- =2-^ilin (7) 

71=0 

A key requirement is that Aiiin is at most half the inte- 
gration time Tmin of the input signal /„. 

As before \ a scheduler guarantees that a linear corre- 
lator block s is only executed after its predecessor s — 1 
has been executed twice. A counter c = 0, 1, . . . is in- 
cremented with every execution of any linear correlator 
block. The scheduler uses the following relations to de- 
termine which linear correlator block s has to be executed 
at a given counter value c (details see appendix): 

s = : c mod 2^=0 

s = 1 : c mod 2^ = 3 (8) 

s > 2 : c mod 2"+^ = (2" - 3) 

FIG. 2(a) shows the solution of this relation for c values 
from to 31. In the binary representation of c for a linear 
correlator block s patterns are evident that can be used to 
implement the scheduler efficiently. As shown in TAB. II 
(for = 8 linear correlators), correlator block s = is 
executed whenever the last bit of c is Ob, correlator s = 1 
is executed when the last two bits are lib and so forth. 
This scheme uses only simple comparison operations. 

Between two consecutive blocks s — 1 and s, adder cir- 
cuitry is inserted to sum up two subsequent input signal 
values /s_i^„_i and Js_i.„. This is done for both the de- 
layed/local as well as the undelayed/global signal, while 
they are processed in the pipeline. 



B. Multi-Pixel Correlator 

The CorrPE described above is much faster than re- 
quired to calculate the ACF for a single pixel in the SPAD 
array. Hence, we can reuse a single CorrPE for mul- 
tiple pixels by switching between pixel contexts. This 
"pixel scheduler" uses a double-buffering strategy. While 
a CorrPE operates on the current context, the previous 
context is exchanged with the next context to be pro- 
cessed. The pixel contexts are stored in external back- 
ground memory (SRAM), because they exceed the capac- 
ity of the internal memory. In FIG. 3 we illustrate how 
we reuse one CorrPE for several pixels in comparison to 
a naive implementation using one CorrPE per pixel. 

In addition to the channel data, the cycle counter c and 
an accumulator for the local and the global input signals 
have to be saved. The latter are used for normalization 
and cross-correlation. 

A single CorrPE is used to process an entire column 
(AGFs of Uy = 32 pixels) of our SPAD array. To handle 
the full Tlx X Uy array of pixels, we instantiate Ux = 32 
CorrPEs in parallel. An overview of this scheme is shown 
in FIG. 4. A "data acquisition" circuit communicates 
with the SPAD array and provides the image data for 
the correlators. As the image data are streamed out row 
by row. and each of the Ux CorrPEs is only processing 
data from one specific context (i.e. a specific row), the 
remaining pixels have to be buffered in 32 FIFOs (first-in 
first-out memory buffer) localized in external RAM. 

In addition our design contains two USB 2.0 interfaces. 
One is used to send the raw data stream from the SPAD 
array to the computer, which allows further data process- 
ing. We also use these raw data to verify our correlator 
design. Via the second interface, intermediate and final 
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results from the correlators are transferred to the host 
computer. The intermediate results allow implementing 
a live view of the ongoing calculation of the ACFs. 



C. ACF Normalization 

The intensity values in the denominator of Eq. 3 are 
typically obtained from monitor channels M-r,^ , which ac- 
cumulate the total photon count at a given lag time r^. 
In contrast to other implementations, we use only one 
monitor Mq (input signal accumulator) per multi-r cor- 
relator. Symmetric normalization (see also Eq. 3) yields 
the following: 



Gr 



T 



9sy 



m.multi-r 



Mo ■ Mr, 



(9) 



After measuring T samples, the number of samples 
that have propagated through the correlator to a distinct 
channel {s,p) is T—Ts^p. Under the assumption that /„ is 
a stationary random process and that T > Ts_j,, the per- 
channel monitors M-r, can be estimated in the following 
manner: 



Mr, 



Mo 



T-T. 



T 



(10) 



This normalization and model fitting is done on the host 
computer, since floating-point arithmetic cannot be im- 
plemented efficiently on an FPGA. Although the corre- 
lation on the FPGA and the normalization on the host 
computer can easily be done in real time, the fits in most 
cases cannot. Usually a single curve fit takes tens of mil- 
liseconds, thus fitting all 1024 ACFs would amount to 
> 10 s additional computing time; but this still allows to 
display fit parameter maps (images) within an acceptably 
short delay. In current imaging FCS software systems '^ ' 
the data are loaded and correlated on the host computer, 
which for a measurement of typically 10 s takes > 10 min 
for 1024 pixels at the frame rate of our sensor. 

To show that the estimation in Eq. 10 yields good 
results, we simulated different correlator types in soft- 
ware. The results for a direct estimation of the ACF 
using Eq. 3 (green), a multi-r correlator with a mon- 
itor channel per lag (blue) and our estimation (ma- 
genta) can be seen in FIG. 5, where the data in (a) 
and (b) were obtained by correlating the input signal 
I{t) = 1 -Hsin(27rt/(1.51 • lO"'')) for which the exact ACF 
is known to be c,(thooroticai)(^-) ^ l-Hcos(27rT/(1.51-10-4)) 
(time t and lags t are unit free). The data in FIG. 5(c) 
was created by simulating a Tsim = 1 s long FCS ex- 
periment with one diffusing species^^ '. It was computed 
with our FCS simulation software described in Rcf. ' " . 
Further details on the simulation code are shown in the 
appendix. 

For short lags the estimated ACFs resemble the the- 
oretical curves quite well. Multi-r correlators have an 



increased absolute error for longer lags, which is due to 
the averaging described in Eq. 6. This can be seen es- 
pecially in the case of the sine wave signal. The multi-r 
estimates can still be used for FCS experiments, as here 
the ACFs usually decay to 1 (white noise) for large lag 
times, and thus the systematic error drops to zero again 
(for a detailed discussion of this, see e.g. Ref. 12). For 
T ^ Tsim/lO, multi-r correlators show additional system- 
atic deviations from the theoretical curve and from the 
direct estimation, because the channels are not averaged 
over sufficiently many samples to yield reliable results. 
Here the multi-r implementation with multiple monitor 
channels performs better due to the better estimation of 
the normalization factor Air ■ 



D. Crosscorrelation (CCF) 



Our design can also calculate cross-correlation: 
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(i(-){t)-i<-y\t + T))^ 



between two input signals I^^'>{t) and I^'^\t) at the local 
J(') and global inputs of the CorrPE (see FIG. 1 
where both signals are tied to I{t) for ACF calculation). 
This changes the multi-r estimator Eq. 9 to: 



«(CCF) 
^sym,mu 



Iti-T ('''s.p) 



T 



M^^^ ■ M, 



(I) 



(11) 



Here the normalization uses the monitors Mi^' for the 



global and M^^ for the local signal {Mr^ 



M^^'> ■ [T ^ 



Ts^p)/T). For autocorrelation (see above) only one mon- 
itor channel is needed, as Mo = M^^^ = M^'^. 



IV. PERFORMANCE 8c IMPLEMENTATION DETAILS 

The complete design is implemented in 
two Virtex-II Pro FPGAs (XC2VP40, Xilinx, 
http://www.xilinx.com/, San Jose, USA), on a 
LASP development board"'. One FPGA is used for data 
acquisition and line reordering, while the other one is 
used to implement the correlators. The total resource 
consumption within the second FPGA is around 80%. 
Using this hardware platform, there are currently P ~ 8 
channels within each of the 5 = 14 linear correlator 
blocks. 

In Eq. 7 we showed that the complete multi-r cor- 
relator can be executed within twice the time needed 
for the first linear correlator block, so we devote half 
of the execution time to the first linear correlator and 
the rest to the remaining blocks. Therefore a new in- 
put sample can be accepted only once every 2Aiii„. Here 
At\in = 2P -I- 3 cycles is the time needed to process a new 
input sample /„ in the first linear correlator block. The 3 
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data word 


usage 


0...111 

119 

113. ..125 

126 
127 


delay register jijp and raw accumulator values 
status couuter c 

intermediate results for accumulated input sig- 
nals 7s,, I 

global monitor channel Mq®' 
local monitor channel Mg ^ 



TABLE 111. Memory layout of a pixel context 



additional cycles arc used for data hand over to the next 
block. 

The CorrPEs run with a clock frequency of 144 MHz, 
which corresponds to an execution time of 264 ns = 
2Atiin. This is the minimum timespan between two sub- 
sequent samples, if no pixel multiplexing is used. Hence, 
our current FPGA platform can calculate 32 different 
ACFs or CCFs with a minimum lag time of 264 ns, which 
is comparable to the 100 ns designs presented in Ref. 17 
and more recently in Ref. 4. Trading time resolution for 
more correlation functions, we can process all 1024 pix- 
els of the SPAD array at a frame rate of 100 kfps in real 
time. 

To estimate the memory consumption of our design, we 
first look at the pixel context, which consists of 128 words 
of 64 bits each. TAB. HI shows a detailed memory lay- 
out. Sixteen of the upper 32 bits of the raw accumulators 
store the current value of the delay registers ji'p. The 
lower 32 bits contain the accumulator Gr^ ^ ■ The monitor 
channels are 32 bits each. Data handover between con- 
secutive blocks (accumulated local and global signals) is 
done via 16 bit- wide memory areas, which is sufficient for 
S < 16. Since in the later linear correlators the accumu- 
lated input signals Is^n are multiplied, the sums Gr^ and 
also the /s,„ can get relatively large and may not fit in 
the 32 bit memory locations available. However, for our 
FCS application we estimate that the word sizes used are 
sufficient, since the per-pixel input event rate is limited. 
For a deeper discussion of this topic, see Ref. 1^. 

For our 32 x 32 pixel SPAD array the pixel contexts 
are stored in 32 • 32 • 128 • 64 bits = 512 KBytes of external 
SRAM. A second SRAM stores the 256 KBytes used for 
the FIFOs (2048 entries each) that hold the pixel data 
until they are processed. 

The correlator design is implemented in VHDL (very 
high speed integrated circuit hardware description lan- 
guage). It can be configured via generics (a VHDL fea- 
ture). Thus we can reuse our design for different needs 
(e.g. for different sensor sizes and frame rates). We ex- 
pect our design to show a significant increase in perfor- 
mance when implemented on newer FPGA generations 
(e.g. Virtex 5 or Virtex 6 from Xilinx). 

To ensure the functional correctness of the designed 
correlator, a simulation of the hardware was tested using 
random input data. The outcome was compared to the 
results of a software implementation of the multi-r cor- 



relator using the same data set. Both yielded exactly the 
same results. The comparison was also done using real 
experimental data from the SPAD array. Again, both 
results were identical. 

To demonstrate the functionality of the whole system, 
we tested the design using the SPAD array to record an 
LED connected to a sine wave generator set to 2.5 kHz. 
FIG. 6 shows the results of this measurement. A fit to 
the data recovered the chosen frequency. 



V. CONCLUSION 

In this paper we presented the implementation of an 
FPGA-based multi-r correlator design that can calculate 
1024 correlation functions in real time at a minimum lag 
time of 10 fj,s. To our knowledge this is the largest num- 
ber of real-time multi-r correlators implemented so far 
in a single device. The minimum lag time of 10 ^s in 
our design is considerably longer than that of currently 
available hardware correlators (e.g. from ALV GmbH, 
Langen, Germany or correlator.com, Bridgewater, USA 
and Ref. 19). However, those are limited to at most 32 
auto-correlators. 

We use our design to correlate the output of a single- 
photon avalanche diode array used as image sensor. This 
combination will allow us to perform imaging fluores- 
cence correlation spectroscopy at sufficiently fast time 
scales to resolve even the motion of small molecules in 
solution and living cells. 

Our design is flexible, so beside estimating ACFs, tem- 
poral cross-correlation functions of different pixels^"'^^ or 
multiple colors-^ (using spectrally resolved detectors) can 
also be calculated. 
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Appendix A: Relation Eq. 8 

By definition, Eq. 8 can be rewritten in the following 
manner, with Gs containing the cycles in which the linear 
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correlator s is processed: 

Co = {c = n - 2^ + I c,n e No} 
Ci = {c==n-22 + 3 I c,nGNo} 
Cs = {c = n- 2"+^ + 2" - 3 I c, 71 e No} Vs > 2 

It remains to show that Cp D C^Vp, q E No, p ^ q. 
Obviously Cq n Vs G N, fc, / G No: 

Ci n C2 = 
^ fc • 2^ + 3 7^ r 2^ + 1 

Without limitations let r < s; Vr, s S Ng; r, s > 2; 
k,le No: 

c,. n Cs = 

^ fc • 2'"+! + 2'' - 3 7^ r 2"+! + 2^* - 3 

'^v-' V ' 2 




V 



Appendix B: FCS Simulation 8i Fits 

To test different types of autocorrelators, we used our 
FCS simulation program described in Ref. 15. It simu- 
lates the 3D trajectories 

n{t),t = O...T,i^,i^l..K 

of a set of K fluorescing particles performing a random 
walk in which each ID step is drawn from a zero-centered 
Gaussian distribution with width 



^jump \2D ■ ^tsin\- 

Here D is the given diffusion coefficient and Aigim is the 
simulation time step. We use a Gaussian approximation 
for the focal illumination and detection volume: 

/ + \ 

h[x, y, z) = exp -2 2 • , 

V <y Twiy J 

where w-^y is the lateral width of the profile, 7 — W2./w^y 
is its aspect ratio and is its length. The program 
then estimates the expected number of photons A'^phot (i) 
detected during each simulation time step [t,t + Atgim]: 

K 

Nphot{t) = Nq ■ Atsini ■ ^ gf • Qdct ■ h{ri{t))'^. 

i=l 





duration of simulation 


1 s 




simulation time step 


1 jJ,S 


K 


number of walkers 


1576 




simulation volume 


524 fim'^ 


C 


concentration of simulated particles 


5nM 


D 


diffusion coefficient 


20 fim^/s 


No 


absorbed photons per molecule 


4.2 • 10^"^ 




lateral width of excitation profile 


0.325 ^im 


Qi 


fluorescence quantum yield 


0.8 


9dct 


detection efficiency 


0.5 


7 


focal aspect ration 


6 


S 


number of linear correlator blocks 


20 


P 


number of channels per block 


16 


m 


lag time ratio between blocks 


2 



TABLE IV. Simulation parameters for the FCS simulation 
yielding the results, displayed in FIG. 5. 



Here A'^o is the maximum number of detected photons per 
fluorophore and time step, while and qdct sltc the quan- 
tum efficiencies of fluorescence and detection. To account 
for the counting statistics in the SPADs, the number of 
photons Afphot(i) actually detected during a simulation 
time step is calculated by drawing a random number from 
a Poissonian distribution with average A^phot(0- 

The time series created by the simulation is then fed 
into the three different autocorrelator software imple- 
mentations: (a) direct estimation, (b) multi-r with one 
monitor channel and (c) multi-r with multiple monitor 
channels. The latter two (b, c) simulate the complete 
structure of a multi-r correlator. In correlator (b) we use 
one single monitor channel, which counts the incoming 
photons only and then uses Eq. 10 for the normalization. 
In the correlator (c) we use one monitor channel per cor- 
relator lag which counts the photons actually processed 
by each lag. 

The simulation parameters for the data in FIG. 5 are 
summarized in TAB. IV. The average photon count rate 
in the simulations was {I{t)) « 157.4 kHz which is - due 
to the chosen detection efficiency q^et = 0.5 - about a 
factor of 5 higher than what would be expected from a 
standard confocal FCS setup. This way the signal to 
noise ratio is high enough to visualize easily artifacts in- 
troduced by the correlator implementation. FIG. 7 shows 
an example of a time trace iVphot (t) / A^sim generated by 
our simulation. 

The resulting ACFs were fitted to a standard 3D FCS 
model function ': 

with the average particle number TV in the effective fo- 
cal volume Veff = tt'^/^ • 7 • w^y and the diffusion decay 
time Tjj = w'^y/AD. The fits were performed using a 
Lcvenbcrg-Marquardt least squares fitting routine. 
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FIG. 1. Hardware design of a linear correlator (a) in comparison with a multi-r correlator (b). Panel (c) shows a schematic 
view of the multi-r correlator, where the linear correlator building blocks are summarized by a single block. Corresponding 
channels are grouped with the same color. Global/undelayed (g) and local/delayed (l) inputs are located on the left. For 
autocorrelation, the global and local signal inputs to the 0-th block (Ji''' = Jn ' = In) are identical. The | &t \ blocks represent 
delay elements, the 8-blocks multipliers and the [^blocks accumulators. 
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FIG. 2. Block scheduling algorithm. The counter c in the first row defines the current cycle. The second row shows the block 
s that is processed in the corresponding cycle. Further below the order of processing is shown: When a block s has been 
processed twice, the following block, s -f 1, can process the sum of the two preceding s-blocks. 



(a) naive implementation: 
one CorrPE per pixel 



3x3 SPAD arra 




9x CorrPE 



(b) optimized implementation 
one CorrPE per column 

3x3 SPAD array line buffer 




-multiplexer' 
line resorter 



3x CorrPE 



FIG. 3. Comparison of a naive implementation (a: one correlator per pixel) and an optimized implementation (b: reuse 
CorrPEs for several pixels) of a multi-pixel multi-r correlator for a 3 x 3 SPAD array. 
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FIG. 4. System layout and data path - from data acquisition to correlation. One USB interface is used to stream the raw 
images, the other for streaming (intermediate) results. The first level cache (LI, double buffered) is used to hold the context 
of the currently processed pixel. FIFOs for row resorting and for context storage use external memory. 




FIG. 5. Simulation results for different implementations of multi-r correlators: The left panels (a,b) show simulations for a sine 
wave input signal I{t) = l + sin(27rt/(0.151 ms)). The simulations on the right (c,d) were created using an FCS simulation. The 
top graphs (a,c) are estimates of the autocorrelation function using direct correlation from Eq. 3 (green), a multi-r correlator 
with one monitor channel per lag time (blue) and our estimated normalization from Eq. 10 (magenta). Graph (a) also shows 
the theorectical ACF ^(thooroticai) j-^^ = 1 + cos(27rr/(0.151 ms)) for the sine signal (light red). The lower graphs (b,d) show the 
absolute deviation of the estimates from ^(theoreticai)^^^ q^-^ from a fit to the curves (d). The parameters resulting from the 
fit in (c) are the same within < 2.5% for all three estimates. 
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FIG. 6. Distribution of 992 (gray) ACFs taken by our sen- 
sor (first 31 columns only), exposed to a 630 nm LED sine- 
modulated with a frequency of 2.5 kHz. The correlator was 
running for 1.2s (131072 samples at Tmin = lOfis). The 
gray curves with significantly lower amplitude are due to 
hot pixels of the SPAD array; since these SPADs fire ran- 
domly, the corresponding correlation amplitude is decreased. 
The median, which tends to be less sensitive towards out- 
liers than the mean, is shown in red. The theoretical model 
^(theoretical) = 1 yl ■ cos(27r/ • t) is fitted against the me- 
dian in the interval [10 fis, 1 ms], and is shown until r = 2 ms. 
The fit yields a frequency of (2502 ± 4) Hz. The fanning out 
of the curve at high values of r is due to unfilled correlator 
channels. Compare also FIG. 5(a). 




FIG. 7. Typical photon count time trace Nphot{t) / Atsim gen- 
erated by our simulation 



